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Abstract 

This work builds upon previous efforts in online incremental learning, namely the 
Incremental Gaussian Mixture Network (IGMN). The IGMN is capable of learning 
from data streams in a single-pass by improving its model after analyzing each data 
point and discarding it thereafter. Nevertheless, it suffers from the scalability 
point-of-view, due to its asymptotic time complexity of 0(iV/\H 3 ) for N data points, 
K Gaussian components and D dimensions, rendering it inadequate for 
high-dimensional data. In this paper, we manage to reduce this complexity to 
0(NKD 2 ^ by deriving formulas for working directly with precision matrices instead of 
covariance matrices. The final result is a much faster and scalable algorithm which 
can be applied to high dimensional tasks. This is confirmed by applying the modified 
algorithm to high-dimensional classification datasets. 

1 Introduction 

The Incremental Gaussian Mixture Network (IGMN) pQ[2] is a supervised algorithm 
which approximates the EM algorithm [3|. It creates and continually adjusts a 
probabilistic model consistent to all sequentially presented data, after each data point 
presentation, and without the need to store any past data points. Its learning process 
is aggressive, meaning that only a single scan through the data is necessary to obtain a 
consistent model. 

IGMN adopts a Gaussian mixture model of distribution components that can be 
expanded to accommodate new information from an input data point, or reduced if 
spurious components are identified along the learning process. Each data point 
assimilated by the model contributes to the sequential update of the model parameters 
based on the maximization of the likelihood of the data. The parameters are updated 
through the accumulation of relevant information extracted from each data point. 

The IGMN is capable of supervised learning, simply by assigning any of its input 
vector elements as outputs (any element can be used to predict any other element, like 
autoassociative neural networks HI)- This feature is useful for simultaneous learning of 
forward and inverse kinematics, as well as for simultaneous learning of a value function 
and a policy in reinforcement learning [5]. 

However, the IGMN suffers from cubic time complexity due to matrix inversion 
operations and determinant computations. Its time complexity is of 0(iV/\Z) 3 ), where 
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N is the number of data points, K is the number of Gaussian components and D is 
the problem dimension. It makes the algorithm prohibitive for high-dimensional tasks 
(like visual tasks) and thus of limited use. One solution would be to use diagonal 
covariance matrices, but this decreases the quality of the results, as already reported 
in previous work Eld- In this work, we propose the use of rank-one updates for both 
inverse matrices and determinants applied to full covariance matrices, thus reducing 
the time complexity to O^NKD 2 ^ for learning while keeping the quality of a full 
covariance matrix solution. 

For the specific case of the IGMN algorithm, to the best of our knowledge, this has 
not been tried before, although we can find similar efforts for related algorithms. 

In [8], rank-one updates were applied to an iterated linear discriminant analysis 
algorithm in order to decrease the complexity of the algorithm. Rank-one updates 
were also used in [Sj, where Gaussian models are employed for feature selection. 
Finally, in )10l . the same kind of optimization was applied to Maximum Likelihood 
Linear Transforms (MLLT). 

The next Section describes the algorithm in more detail with the latest 
improvements to date. Section [3] describes our improvements to the algorithm. Section 
S] shows the experiments and results obtained from both versions of the IGMN for 
comparison, and Section [5] finishes this work with concluding remarks. 


2 Incremental Gaussian Mixture Network 

In the next subsections we describe the current version of the IGMN algorithm. 

2.1 Learning 

The algorithm starts with no components, which are created as necessary (see 
subsection 12.21) . Given input x (a single instantaneous data point), the IGMN 
algorithm processing step is as follows. First, the squared Mahalanobis distance 
ci 2 (x, j) for each component j is computed: 


= ( x -Mj) T CqNx-Mj) (1) 

where Hj is the j th component mean, Cj its full covariance matrix . If any d 2 (x,j ) 
is smaller than than x'jj i-p (the 1 — /3 percentile of a chi-squared distribution with D 
degrees-of-freedom, where D is the input dimensionality and j3 is a user defined 
meta-parameter, e.g., 0.1), an update will occur, and posterior probabilities are 
calculated for each component as follows: 


P( x lj) 


1 

{ 2 - k ) d / 2 -y/fCj]" ° XP 



P(j l x ) 


P(x\j)p(j) w- 

K J 

^2p(x\k)p(k) 

k=l 


( 2 ) 

(3) 


where K is the number of components. Now, parameters of the algorithm must be 
updated according to the following equations: 


Vj(t) = Vj(t — 1) + 1 


(4) 


SPj{t) = spj{t - 1) +K?|x) 


(5) 


2 /m 


PLOS 







c j(t) 


(1 


e, = x fAj (t 1) 

(6) 

P(j |x) 

LUj = - 

S Pj 

(7) 

Ap,j = Wj-ej 

(8) 

~ 1) + Anj 

(9) 

e *j = x - Hj(t) 

(10) 

- Uj)Cj{t - 1) + Wj-e*e* T - AnjAnJ 

(11) 
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where spj and Vj are the accumulator and the age of component j , respectively, 
and p(j ) is its prior probability. 


2.2 Creating New Components 

If the update condition in the previous subsection is not met, then a new component j 
is created and initialized as follows: 


/i) = x; spj = 1; vj = 1; p{j) 



= c'L 1 


J2 S Pi 
2 — 1 

where K already includes the new component and (Jini can be obtained by: 


O’ ini = Sstd(x) (13) 

where <5 is a manually chosen scaling factor (e.g., 0.01) and std is the standard 
deviation of the dataset. Note that the IGMN is an online and incremental algorithm 
and therefore it may be the case that we do not have the entire dataset to extract 
descriptive statistics. In this case the standard deviation can be just an estimation 
(e.g., based on sensor limits from a robotic platform), without impacting the algorithm. 


2.3 Removing Spurious Components 

A component j is removed whenever Vj > v m i n and spj < sp m i n , where v m i n and 
spmin are manually chosen (e.g., 5.0 and 3.0, respectively). In that case, also, p(k) 
must be adjusted for all k € K, k ^ j, using (fT^l) . In other words, each component is 
given some time v m i n to show its importance to the model in the form of an 
accumulation of its posterior probabilities spj. 
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2.4 Inference 

In the IGMN, any element can be predicted by any other element. This is done by 
reconstructing data from the target elements (x t , a slice of the entire input vector x) 
by estimating the posterior probabilities using only the given elements (x,, also a slice 
of the entire input vector x), as follows: 


p 01*0 = V ( 14 ) 

^2p{xi\q)p(q) 

9=1 

It is similar to ©, except that it uses a modified input vector x* with the target 
elements x t removed from calculations. After that, x t can be reconstructed using the 
conditional mean equation: 


M 

Xt = 5Z^’|xi)(Mj,t + C,oC ,'(x, - n jti )) (15) 

3 =1 

where Cj t ti. is the submatrix of the jth component covariance matrix associating 
the unknown and known parts of the data, Cj.i is the submatrix corresponding to the 
known part only and Hj,i is the jth’s component mean without the element 
corresponding to the target element. 


3 Fast IGMN 


One of the contributions of this work lies in the fact that Equation Q] (the squared 
Mahalanobis distance) requires a matrix inversion, which has a asymptotic time 

complexity of 0(.D 3 ), for D dimensions (0(zT° 927+ °w) for the Strassen algorithm or 
at best o(U 2 - 3728639 ) with the most recent algorithms to date [IT]). This renders the 
entire IGMN algorithm as impractical for high-dimension tasks. Here we show how to 
work directly with the inverse of covariance matrix (also called the precision or 
concentration matrix) for the entire procedure, therefore avoiding costly inversions. 

Firstly, let us denote CN 1 = A, the precision matrix. Our task is to adapt all 
equations involving C to instead use A. 

We now proceed to adapt Equation ITT1 ( covariance matrix update). This equation 
can be seen as a sequence of two rank-one updates to the C matrix, as follows: 


c j{t) = (1 - U3j)Cj(t - 1) + ujje*e* T 

C j(t) = Cj{t) - A/^A/xJ 

This allows us to apply the Sherman-Morrison formula [12] 

A 1 uv T A 1 


(A 


uvT- 1 = A' 1 - 


(16) 

(17) 

(18) 


1 + v T A 3 u 

This formula shows how to update the inverse of a matrix plus a rank-one update. 
For the second update, which subtracts, the formula becomes: 


(A — uv 


T '' -1 = A -1 + 


A 3 uv T A 1 


1 — v T A 

In the context of IGMN, we have A = (1 — uj)Cj(t — 1) = (1 — w)AJ 1 (t — 1) and 
u = v = yfijje* for the first update, while for the second one we have A = C 7 (t) and 
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u = v = A fj,j. Rewriting |T8] and [T9] we get (for the sake of compactness, assume all 
subscripts for A and A/x to be j): 


A(t — 1) l)e*e* T A(t - 1) 

1 — oj 1 + Y^e* T A(t — l)e* 


( 20 ) 


A(t) = A (t) + 


A(t)AfiAfi T A(t) 
1 — Afi T A(t)Afi 


( 21 ) 


These two equations allow us to update the precision matrix directly, eliminating 
the need for the covariance matrix C. They have 0(A 2 ) complexity due to 
matrix-vector products. 

Following on the adaptation of the IGMN equations, Equation [T] (the squared 
Mahalanobis distance) allows for a direct substituion, yielding the following new 
equation: 


d 2 M (*,j) = (x - Hj) T Aj(x - fij) (22) 

which now has a 0(fV 2 ) complexity, since there is no matrix inversion as the 
original equation. After removing the cubic complexity from this step, the 
determinant computation will be dealt with next. 

Since the determinant of the inverse of a matrix is simply the inverse of the 
determinant, it is sufficient to invert the result. But computing the determinant itself 
is also a 0(H 3 ) operation, so we will instead perform rank-one updates using the 
Matrix Determinant Lemma m, which states the following: 

|A + uv T | = |A|(l + v T A“ 1 u) (23) 

|A — uv T | = | A| (1 — v T A _1 u) (24) 

Since the IGMN covariance matrix update involves a rank-two update, adding a 
term and then subtracting one, both rules must be applied in sequence, similar to 
what has been done with the A equations. Equations ITCI and [T71 may be reused here, 
together with the same substitutions previously showed, leaving us with the following 
new equations for updating the determinant (again, j subscripts were dropped): 

|C(i)| = (1 - co) D \C(t - 1)| (l + j^e* T A(t - l)e*) (25) 

|C(t)| = |C(i)|(l — Afi T A(t)Afi) (26) 

This was the last source of cubic complexity, which is now quadratic. 

Finishing the adaptation in the learning part of the algorithm, we just need to 
define the initialization for A for each component. What previously was C ; = er 2 ni I 
now becomes Aj = er^I, the inverse of the variances of the dataset. Since this matrix 
is diagonal, there are no costly inversions involved. And for initializing the 
determinant |C|, just set it to n which again takes advantage of the initial 
diagonal matrix to avoid costly operations. Note that we keep the precision matrix A, 
but the determinant of the covariance matrix C instead. See algorithms |T] to [3] for a 
summary of the new learning algorithm (excluding pruning, for brevity). 

Finally, the inference Equation fT5l must also be updated in order to allow the 
IGMN to work in supervised mode. This can be accomplished by the use of a block 
matrix decomposition (note that here C is just another sub-matrix, not the covariance 
matrix as used before): 
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Algorithm 1 Fast IGMN Learning 

Input: <5,/3,X 

^>0, <T-\ = (6std(X))-\M = V 
for all input data vector x £ X do 

if K = 0 or 3j, d 2 M (x,j) < X 2 D ,i-p then 
update(x) 

else 

M <— M U create (x) 

end if 
end for 


Algorithm 2 update 
Input: x 

for all Gaussian component j G M do 

Compute equations Q] to [12] substituting [T| for [22] and [TT] for [20] and [21] 
Compute equations [25] and [26] 
end for 


A 

B 

-1 

X 

Y 

C 

D 


Z 

W 


(A-BD^C)- 1 —A _1 B(D — CA _1 B) _1 

_D 1 C(A-BD 1 C)- 1 (D-CA^B)- 1 

Here, according to Equation [L5l we need C and A -1 . But since the terms that 
constitute these sub-matrices are relative to the original covariance matrix (which we 
do not have), they must be extracted from the precision matrix directly. Looking at 
the decomposition, it is clear that YW -1 = — A -1 B = — CA _1 (the terms between 
parenthesis in Y and W cancel each other, while B = C T due to symmetry). So 
Equation 1151 can be rewritten as: 

M 

x t = 5>01xi)(#*j,* - YW 4 (x, - /ij.i)) (27) 

i=i 

where Y and W can be extracted directly from A. However, we still need to 
compute the inverse of W. So we can say that this particular implementation has 
O (NKD 2 ) complexity for learning and 0[NKD 3 ) for inference. The reason for us to 
not worry about that is that d = i + o, where i is the number of inputs and o is the 
number of outputs. The inverse computation acts only upon the output portion of the 
matrix. Since, in general, o <^L i (in many cases even o = 1), the impact is minimal, 
and the same applies to the YW -1 product. In fact, Weka (the data mining platform 
used in this work [14]) allows for only 1 output, leaving us with just scalar operations. 


4 Experiments 

In order to evaluate the performance of the proposed algorithm, 11 classification tasks 
(Table [l]) were given to the original and improved IGMN algorithms (d = 1 and (3 = 0, 
so a single component was created for each run and we could focus on speed ups only 
due to dimensionality). Results were obtained from 2-fold cross-validation and 
statistical significances came from paired t-tests with p = 0.05. 

This experiment was meant to verify that both IGMN implementations produce 
exactly the same results, which was confirmed, as well as to show that the proposed 
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Algorithm 3 create 

Input: x 

K <- I< + 1 

return new Gaussian component K with = x, = cr~^ I, \Ck\ = |A^| _1 , 
Spj = 1, Vj = 1, p{j) = -jf 1 — 

fe=l 


Table 1. Datasets 


Dataset 

Instances (N) 

Attributes (D) 

Classes 

breast-cancer 

286 

9 

2 

german-credit 

1000 

20 

2 

pima-diabetes 

768 

8 

2 

Glass 

214 

9 

7 

ionosphere 

351 

34 

2 

iris 

150 

4 

3 

labor-neg-data 

57 

16 

2 

soybean 

683 

35 

19 

twospirals 

193 

2 

2 

MNIST fl5l (subset) 

1000 

784 

10 

CIFAR-10 1161 (subset) 

1000 

3072 

10 

CIFAR-lOb (subset) 

100 

3072 

10 


improvements really deliver the expected speed up (the Weka packages for both 
variations of the IGMN algorithm can be found at 

http://www.inf.ufrgs.br/~rcpinto). This was also confirmed, as can be seen in 
tables [2] and [3] (note that the experiments were divided into training and test phases 
just for comparison purposes, but IGMN is in fact an online algorithm; also note that 
standard deviations were rounded to 3 decimal places and, in fact, there are not any 
null standard deviations in the results). Our improved algorithm could deliver a speed 
up of 2 orders of magnitude in training time (learning) for the CIFAR-10 subset, which 
follows our expectations according to the new time complexity. Datasets with less 
dimensions could benefit from the improvements too, albeit in a smaller scale, which 
was also expected. The other confirmation came from the testing times (inference): 
they were also improved, since the inverse matrix computation was eliminated from 
the probability density equation, which is also necessary for inference, and the matrix 
multiplications and inversions are actually scalar operations, having no impact. In fact, 
the speed up for inference was around 3 orders of magnitude for the CIFAR-10 subset. 

Table 2. Training Time (in seconds) 


Dataset 

IGMN 


Fast IGMN 

breast-cancer 

0 . 010 ± 

0.004 

0.006 

± 

0.000 

german-credit 

0.031± 

0.012 

0.016 

± 

0.000 

pima-diabetes 

0.013± 

0.000 

0.010 

± 

0.001 

Glass 

0.008± 

0.000 

0.005 

± 

0.000 • 

ionosphere 

0.022± 

0.002 

0.014 

± 

0.002 • 

iris 

0.005± 

0.000 

0.007 

± 

0.001 

labor-neg-data 

0.006± 

0.001 

0.007 

± 

0.001 

soybean 

0.042± 

0.004 

0.030 

± 

0.003 

twospirals 

0.005± 

0.001 

0.006 

± 

0.001 o 

MNIST 

281.257± 

3.157 

10.675 

± 

0.272 • 

CIFAR-10 

20768.494±1244.221 

175.243 

± 

1.190 • 

Average 

1754.417 


15.526 




o, • statistically significant increase or decrease in time 


Although not the focus of this work, we could also observe the accuracy of the 
IGMN algorithm on the tested datasets in comparison to other algorithms available in 
the Weka software, like Support Vector Machines (SVM) and the state-of-the-art 
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Table 3. Testing Time (in seconds) 


Dataset 

IGMN 



Fast IGMN 

breast-cancer 

O.OOli 

0.000 

0.001 

± 

0.001 

german-credit 

0.013± 

0.004 

0.002 

± 

0.000 

pima-diabetes 

0.004± 

0.000 

0.001 

± 

0.000 • 

Glass 

0.001± 

0.000 

0.001 

± 

0.000 

ionosphere 

0.010± 

0.001 

0.008 

± 

0.008 

iris 

0.000± 

0.000 

0.001 

± 

0.000 o 

labor-neg-data 

0.000± 

0.000 

0.001 

± 

0.001 

soybean 

0.026± 

0.007 

0.004 

± 

0.000 

twospirals 

0.000± 

0.000 

0.000 

± 

0.000 

MNIST 

225.262± 

5.638 

0.607 

± 

0.149 • 

CIFAR-10 

17821.407±1699.599 

7.793 

± 

0.098 

Average 

1504.097 


0.702 




o, • statistically significant increase or decrease in time 


Table 4. Area Under ROC Curve 


Dataset 

Neural Network 

1-NN 

Naive Bayes 

SVM 

IGMN 

FIGMN 

breast-cancer 

0.65± 

0.01 

0.59±0.02 

0.70±0.02 

0.61±0.04 

0.60±0.00 

0.60±0.00 

CIFAR-10b 

0.83± 

0.03 

0.611=0.19 

0.51±0.03 

0.64±0.14 

0.621=0.20 

0.621=0.20 

german-credit 

0.79± 

0.01 

0.651=0.01 • 

0.791=0.00 

0.601=0.01 

0.621=0.03 

0.621=0.03 

pima-diabetes 

0.84± 

0.00 

0.641=0.02 

0.81±0.01 

0.73±0.02 

0.691=0.03 

0.691=0.03 

Glass 

0.81± 

0.02 

0.781=0.05 

0.701=0.10 

0.72±0.12 

0.791=0.04 

0.791=0.04 

ionosphere 

0.95± 

0.03 

0.811=0.01 

0.931=0.00 

0.821=0.03 • 

0.901=0.03 

0.901=0.03 

iris 

1 . 00 ± 

0.00 

1.001=0.00 

1.001=0.00 

1.001=0.00 

1.001=0.00 

1.001=0.00 

labor-neg-data 

0.93± 

0.01 

0.791=0.07 

0.94±0.02 

0.94±0.02 

0.911=0.02 

0.911=0.02 

MNIST 

1 . 00 ± 

0.00 

0.971=0.03 

0.961=0.00 

0.971=0.04 

0.951=0.05 

0.951=0.05 

soybean 

1 . 00 ± 

0.00 

1.001=0.00 

1.001=0.00 

1.001=0.00 

1.001=0.00 

1.001=0.00 

twospirals 

0.50± 

0.08 

0.761=0.02 

0.481=0.00 

0.471=0.03 

0.611=0.12 

0.611=0.12 

Average 

0.84 


0.78 

0.80 

0.77 

0.79 

0.79 


• statistically significant degradation 


Dropout Neural Networks (17j (with 50 hidden neurons, 50% dropout for the hidden 
layer and 20% dropout for the input layer; the specific implementation can be found at 
https://github.com/amten/NeuralNetwork). It was statistically similar to them with 
/? = 0.001 (now more than one Gaussian component was allowed) and tuning the S 
parameter by 2-fold cross-validation using 3 different values (0.01, 0.1 and 1), but with 
the additional benefit of a single-scan through data, allowing it to operate on data 
streams. Results are shown in table 0] (note that, for this experiment, a smaller subset 
of CIFAR-10 was used, in order to compensate for the higher computational 
requirements of more Gaussian components). 


5 Conclusion 

We have shown how to work directly with precision matrices in the IGMN algorithm, 
avoiding costly matrix inversions by performing rank-one updates. The determinant 
computations were also avoided using a similar method, effectively eliminating any 
source of cubic complexity for the learning algorithm. This resulted in substantial 
speed ups for high-dimensional datasets, turning the IGMN into a good option for this 
kind of tasks. The inference operation still has cubic complexity, but we argue that it 
has a much smaller impact on the total runtime of the algorithm, since the number of 
outputs is usually much smaller than the number of inputs. This was confirmed for 
one-dimensional outputs, which require only scalar operations. 

In general, we could see that the fast IGMN is a good option for supervised 
learning, with low runtimes and good accuracy after adjusting the two main 
meta-parameters /3 and S. It should be noted that this is achieved with a single-pass 
through the data, making it also a valid option for data streams. 
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